Search CORE

Macquarie University ResearchOnline

FastBLAST: Homology Relationships for Millions of Proteins

Author: A Marchler-Bauer
AA Schaffer
Adam P. Arkin
BE Suzek
Cecile Fairhead
CH Wu
CM Zmasek
D Wilson
F Pearl
H Mi
I Letunic
JD Selengut
LB Koski
M Remm
MN Price
Morgan N. Price
NJ Mulder
Paramvir S. Dehal
PS Dehal
R Durbin
RD Finn
RL Tatusov
S Yooseph
SF Altschul
W Gish
W Li
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

BackgroundAll-versus-all BLAST, which searches for homologous pairs of sequences in a database of proteins, is used to identify potential orthologs, to find new protein families, and to provide rapid access to these homology relationships. As DNA sequencing accelerates and data sets grow, all-versus-all BLAST has become computationally demanding.Methodology/principal findingsWe present FastBLAST, a heuristic replacement for all-versus-all BLAST that relies on alignments of proteins to known families, obtained from tools such as PSI-BLAST and HMMer. FastBLAST avoids most of the work of all-versus-all BLAST by taking advantage of these alignments and by clustering similar sequences. FastBLAST runs in two stages: the first stage identifies additional families and aligns them, and the second stage quickly identifies the homologs of a query sequence, based on the alignments of the families, before generating pairwise alignments. On 6.53 million proteins from the non-redundant Genbank database ("NR"), FastBLAST identifies new families 25 times faster than all-versus-all BLAST. Once the first stage is completed, FastBLAST identifies homologs for the average query in less than 5 seconds (8.6 times faster than BLAST) and gives nearly identical results. For hits above 70 bits, FastBLAST identifies 98% of the top 3,250 hits per query.Conclusions/significanceFastBLAST enables research groups that do not have supercomputers to analyze large protein sequence data sets. FastBLAST is open source software and is available at http://microbesonline.org/fastblast

eScholarship - University of California

PhyloPat: phylogenetic pattern analysis of eukaryotic genes

Author: A Kasprzyk
C Minguillon
DA Natale
DL Wheeler
E Birney
F Al-Shahrour
F Chen
GP Wagner
H Li
Jacob de Vlieg
JF Dufayard
JO Korbel
K Reichard
M Ashburner
Peter MA Groenen
PS Dehal
R Fredriksson
RC Edgar
S Guindon
T Hulsen
TA Eyre
Tim Hulsen
V Matys
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Phylogenetic patterns show the presence or absence of certain genes or proteins in a set of species. They can also be used to determine sets of genes or proteins that occur only in certain evolutionary branches. Phylogenetic patterns analysis has routinely been applied to protein databases such as COG and OrthoMCL, but not upon gene databases. Here we present a tool named PhyloPat which allows the complete Ensembl gene database to be queried using phylogenetic patterns. DESCRIPTION: PhyloPat is an easy-to-use webserver, which can be used to query the orthologies of all complete genomes within the EnsMart database using phylogenetic patterns. This enables the determination of sets of genes that occur only in certain evolutionary branches or even single species. We found in total 446,825 genes and 3,164,088 orthologous relationships within the EnsMart v40 database. We used a single linkage clustering algorithm to create 147,922 phylogenetic lineages, using every one of the orthologies provided by Ensembl. PhyloPat provides the possibility of querying with either binary phylogenetic patterns (created by checkboxes) or regular expressions. Specific branches of a phylogenetic tree of the 21 included species can be selected to create a branch-specific phylogenetic pattern. Users can also input a list of Ensembl or EMBL IDs to check which phylogenetic lineage any gene belongs to. The output can be saved in HTML, Excel or plain text format for further analysis. A link to the FatiGO web interface has been incorporated in the HTML output, creating easy access to functional information. Finally, lists of omnipresent, polypresent and oligopresent genes have been included. CONCLUSION: PhyloPat is the first tool to combine complete genome information with phylogenetic pattern querying. Since we used the orthologies generated by the accurate pipeline of Ensembl, the obtained phylogenetic lineages are reliable. The completeness and reliability of these phylogenetic lineages will further increase with the addition of newly found orthologous relationships within each new Ensembl release

Public Library of Science (PLOS)

Radboud Repository

Defining Reference Sequences for Nocardia Species by Similarity and Clustering Analyses of 16S rRNA Gene Sequence Data

Author: A Gilat
A Krogh
A Roth
AH Fielding
BA Brown-Elliott
BE Dutilh
BT Grenfell
D Steinke
DA Benson
DM Nelson
F Kong
Fanrong Kong
H Manal
IH Witten
J Felsenstein
JE Clarridge III
JM Janda
JP Euzeby
KT Konstantinidis
KY Yeung
L Lancashire
Leonardo A. Sechi
LR McTaggart
M Helal
M Xiao
MA Saubolle
Manal Helal
MG Höfle
Michael Bain
NG Sgourakis
OG Pybus
P Agius
P Baldi
PD Hebert
PS Conville
PS Conville
PS Conville
PS Dehal
R Christen
R Edgar
R Karchin
RC Edgar
Richard Christen
SB Needleman
Sharon C. A. Chen
T Davidsen
T Frickey
V Savolainen
V Sintchenko
Vitali Sintchenko
VM Markowitz
W Li
WF Doolittle
Y Zhao
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

International audienceBACKGROUND: The intra- and inter-species genetic diversity of bacteria and the absence of 'reference', or the most representative, sequences of individual species present a significant challenge for sequence-based identification. The aims of this study were to determine the utility, and compare the performance of several clustering and classification algorithms to identify the species of 364 sequences of 16S rRNA gene with a defined species in GenBank, and 110 sequences of 16S rRNA gene with no defined species, all within the genus Nocardia. METHODS: A total of 364 16S rRNA gene sequences of Nocardia species were studied. In addition, 110 16S rRNA gene sequences assigned only to the Nocardia genus level at the time of submission to GenBank were used for machine learning classification experiments. Different clustering algorithms were compared with a novel algorithm or the linear mapping (LM) of the distance matrix. Principal Components Analysis was used for the dimensionality reduction and visualization. RESULTS: The LM algorithm achieved the highest performance and classified the set of 364 16S rRNA sequences into 80 clusters, the majority of which (83.52%) corresponded with the original species. The most representative 16S rRNA sequences for individual Nocardia species have been identified as 'centroids' in respective clusters from which the distances to all other sequences were minimized; 110 16S rRNA gene sequences with identifications recorded only at the genus level were classified using machine learning methods. Simple kNN machine learning demonstrated the highest performance and classified Nocardia species sequences with an accuracy of 92.7% and a mean frequency of 0.578. CONCLUSION: The identification of centroids of 16S rRNA gene sequence clusters using novel distance matrix clustering enables the identification of the most representative sequences for each individual species of Nocardia and allows the quantitation of inter- and intra-species variability

HAL-UNICE

Brunel University Research Archive

UNSWorks

Ultra-fast sequence clustering from similarity networks with SiLiX

Author: A Krishnamurthy
AJ Enright
AJ Vilella
AY Signorovitch
F Servant
H Li
HJ Atkinson
I Katriel
J Ruan
JL Boore
JM Joseph
KD Pruitt
Laurent Duret
MH Alsuwaiyel
PK Wall
PS Dehal
R Petryszak
R Tarjan
RD Finn
RE Tarjan
S Hartmann
S Hunter
S Penel
S Vishwanathan
SF Altschul
Simon Penel
SK Das
T Meinel
T Wittkop
Vincent Miele
Y Bramoulle
Y Han
Y Loewenstein
Y Tian
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background The number of gene sequences that are available for comparative genomics approaches is increasing extremely quickly. A current challenge is to be able to handle this huge amount of sequences in order to build families of homologous sequences in a reasonable time. Results We present the software package <monospace>SiLiX</monospace> that implements a novel method which reconsiders single linkage clustering with a graph theoretical approach. A parallel version of the algorithms is also presented. As a demonstration of the ability of our software, we clustered more than 3 millions sequences from about 2 billion BLAST hits in 7 minutes, with a high clustering quality, both in terms of sensitivity and specificity. Conclusions Comparing state-of-the-art software, <monospace>SiLiX</monospace> presents the best up-to-date capabilities to face the problem of clustering large collections of sequences. <monospace>SiLiX</monospace> is freely available at <url>http://lbbe.univ-lyon1.fr/SiLiX</url>.</p

INRIA a CCSD electronic archive server

Publikationer från Linköpings universitet

HAL Descartes

Genetic variants of CYP3A5, CYP2D6, SULT1A1, UGT2B15 and tamoxifen response in postmenopausal patients with breast cancer

Author: A Iida
A Westlind-Johnsson
AN Tucker
Bo Nordenskjöld
BS Katzenellenbogen
C Fabian
C Malet
CK Osborne
CS Murphy
D Lehmann
E Coezy
E Hustert
E Lévesque
Early Breast Cancer Trialist's Collaborative Group (EBCTCG)
ER Kisanga
F Jacolot
HK Crewe
I Koch
J MacCallum
JK Lamba
John Carstensen
L Gallicchio
MD Johnson
MP Goetz
MW Coughtrie
N Hanioka
Olle Stål
P Kuehl
P Wegman
Pia Wegman
PS Shih
RB Raftogianis
RHN van Schaik
S Nowell
SA Nowell
Sauli Elingarami
SS Dehal
Sten Wingren
T Nishiyama
V Stearns
VC Jordan
Y Jin
YC Lim
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

INTRODUCTION: Tamoxifen therapy reduces the risk of recurrence and prolongs the survival of oestrogen-receptor-positive patients with breast cancer. Even if most patients benefit from tamoxifen, many breast tumours either fail to respond or become resistant. Because tamoxifen is extensively metabolised by polymorphic enzymes, one proposed mechanism underlying the resistance is altered metabolism. In the present study we investigated the prognostic and/or predictive value of functional polymorphisms in cytochrome P450 3A5 CYP3A5 (*3), CYP2D6 (*4), sulphotransferase 1A1 (SULT1A1; *2) and UDP-glucuronosyltransferase 2B15 (UGT2B15; *2) in tamoxifen-treated patients with breast cancer. METHODS: In all, 677 tamoxifen-treated postmenopausal patients with breast cancer, of whom 238 were randomised to either 2 or 5 years of tamoxifen, were genotyped by using PCR with restriction fragment length polymorphism or PCR with denaturing high-performance liquid chromatography. RESULTS: The prognostic evaluation performed in the total population revealed a significantly better disease-free survival in patients homozygous for CYP2D6*4. For CYP3A5, SULT1A1 and UGT2B15 no prognostic significance was observed. In the randomised group we found that for CYP3A5, homozygous carriers of the *3 allele tended to have an increased risk of recurrence when treated for 2 years with tamoxifen, although this was not statistically significant (hazard ratio (HR) = 2.84, 95% confidence interval (CI) = 0.68 to 11.99, P = 0.15). In the group randomised to 5 years' tamoxifen the survival pattern shifted towards a significantly improved recurrence-free survival (RFS) among CYP3A5*3-homozygous patients (HR = 0.20, 95% CI = 0.07 to 0.55, P = 0.002). No reliable differences could be seen between treatment duration and the genotypes of CYP2D6, SULT1A1 or UGT2B15. The significantly improved RFS with prolonged tamoxifen treatment in CYP3A5*3 homozygotes was also seen in a multivariate Cox model (HR = 0.13, CI = 0.02 to 0.86, P = 0.03), whereas no differences could be seen for CYP2D6, SULT1A1 and UGT2B15. CONCLUSION: The metabolism of tamoxifen is complex and the mechanisms responsible for the resistance are unlikely to be explained by a single polymorphism; instead it is a combination of several mechanisms. However, the present data suggest that genetic variation in CYP3A5 may predict response to tamoxifen therapy

Digitala Vetenskapliga Arkivet - Academic Archive On-line

NM-AIST Repository

ProPhylo: partial phylogenetic profiling to guide protein family construction and assignment of biological process

Author: CJ Stubben
D Barker
D Barker
D Haft
D Szklarczyk
DA Rodionov
Daniel H Haft
DH Haft
DH Haft
DH Haft
EM Marcotte
F Eckstein
F Enault
GV Glazko
H-Y Ou
J Sun
J Wu
J-P Vert
JAG Ranea
JD Selengut
JD Selengut
JD Selengut
Jeremy D Selengut
L Ferrer
M Csurös
M Huynen
M Pellegrini
MA Huynen
Malay K Basu
MS Gelfand
P Pagel
PM Bowers
PR Kensche
PS Dehal
R Jothi
RL Tatusov
S Briesemeister
S Freilich
SR Eddy
SV Date
SV Date
T Blum
T Gaasterland
T Xu
T Yamada
X Brazzolotto
Y Hong
Y Liu
Y Zhou
Z Jiang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Shotgun sequencing of Yersinia enterocolitica strain W22703 (biotype 2, serotype O:9): genomic evidence for oscillation between invertebrates and mammals

Abstract Background <it>Yersinia enterocolitica </it>strains responsible for mild gastroenteritis in humans are very diverse with respect to their metabolic and virulence properties. Strain W22703 (biotype 2, serotype O:9) was recently identified to possess nematocidal and insecticidal activity. To better understand the relationship between pathogenicity towards insects and humans, we compared the W22703 genome with that of the highly pathogenic strain 8081 (biotype1B; serotype O:8), the only <it>Y. enterocolitica </it>strain sequenced so far. Results We used whole-genome shotgun data to assemble, annotate and analyse the sequence of strain W22703. Numerous factors assumed to contribute to enteric survival and pathogenesis, among them osmoregulated periplasmic glucan, hydrogenases, cobalamin-dependent pathways, iron uptake systems and the <it>Yersinia </it>genome island 1 (YGI-1) involved in tight adherence were identified to be common to the 8081 and W22703 genomes. However, sets of ~550 genes revealed to be specific for each of them in comparison to the other strain. The plasticity zone (PZ) of 142 kb in the W22703 genome carries an ancient flagellar cluster Flg-2 of ~40 kb, but it lacks the pathogenicity island YAPIYe, the secretion system <it>ysa </it>and <it>yts1</it>, and other virulence determinants of the 8081 PZ. Its composition underlines the prominent variability of this genome region and demonstrates its contribution to the higher pathogenicity of biotype 1B strains with respect to W22703. A novel type three secretion system of mosaic structure was found in the genome of W22703 that is absent in the sequenced strains of the human pathogenic <it>Yersinia </it>species, but conserved in the genomes of the apathogenic species. We identified several regions of differences in W22703 that mainly code for transporters, regulators, metabolic pathways, and defence factors. Conclusion The W22703 sequence analysis revealed a genome composition distinct from other pathogenic <it>Yersinia enterocolitica </it>strains, thus contributing novel data to the <it>Y. enterocolitica </it>pan-genome. This study also sheds further light on the strategies of this pathogen to cope with its environments.</p

Public Library of Science (PLOS)

Evidence-Based Annotation of Gene Function in Shewanella oneidensis MR-1 Using Genome-Wide Fitness Profiling across 121 Conditions

Author: A Mitchell
A Typas
Adam Deutschbauer
Adam P. Arkin
AM Deutschbauer
AM Smith
B Christen
B Efron
B Rost
BJ Akerley
C Yang
CP Ewing
CR Myers
DE Cameron
DJ Burdige
E Alm
E Fischer
G Butland
G Butland
G Giaever
GC Langridge
GE Pinchuk
GW Birrell
H Gao
H Ochman
HH Hau
HS Girgis
I Tagkopoulos
IM Keseler
J Oh
J Oh
J Quan
JA Gralnick
Jason K. Baumohl
JD Gawronski
JD Peterson
JF Heidelberg
JJ Faith
JK Fredrickson
JL Groh
JR Warner
K Kobayashi
K Suzuki
Kelly M. Wetmore
KR Brocklehurst
KT Konstantinidis
L Binnenkade
LA Gallagher
LA Gallagher
M Hashimoto
M Huynen
ME Driscoll
ME Hillenmeyer
ME Hillenmeyer
ME Kovach
Michelle Nguyen
MJ Lercher
MN Price
Morgan N. Price
MY Galperin
N Daraselia
N Ishii
NT Liberati
P Burghout
Paul M. Richardson
PS Dehal
PS Novichkov
Q Ren
R Bouhenni
R Zhang
RA Larsen
Raquel Tamse
RJ Nichols
RJ Roberts
RL Tatusov
RM Martinez
Ronald W. Davis
S Kuhner
S Kumari
S Weinitschke
SE Pierce
SJ Cooper
SK Sharan
SY Gerdes
T Baba
T van Opijnen
TR Hughes
V de Berardinis
Wenjun Shao
Y Liu
Zhuchen Xu
Publication venue: Public Library of Science
Publication date: 01/11/2011
Field of study

Most genes in bacteria are experimentally uncharacterized and cannot be annotated with a specific function. Given the great diversity of bacteria and the ease of genome sequencing, high-throughput approaches to identify gene function experimentally are needed. Here, we use pools of tagged transposon mutants in the metal-reducing bacterium Shewanella oneidensis MR-1 to probe the mutant fitness of 3,355 genes in 121 diverse conditions including different growth substrates, alternative electron acceptors, stresses, and motility. We find that 2,350 genes have a pattern of fitness that is significantly different from random and 1,230 of these genes (37% of our total assayed genes) have enough signal to show strong biological correlations. We find that genes in all functional categories have phenotypes, including hundreds of hypotheticals, and that potentially redundant genes (over 50% amino acid identity to another gene in the genome) are also likely to have distinct phenotypes. Using fitness patterns, we were able to propose specific molecular functions for 40 genes or operons that lacked specific annotations or had incomplete annotations. In one example, we demonstrate that the previously hypothetical gene SO_3749 encodes a functional acetylornithine deacetylase, thus filling a missing step in S. oneidensis metabolism. Additionally, we demonstrate that the orphan histidine kinase SO_2742 and orphan response regulator SO_2648 form a signal transduction pathway that activates expression of acetyl-CoA synthase and is required for S. oneidensis to grow on acetate as a carbon source. Lastly, we demonstrate that gene expression and mutant fitness are poorly correlated and that mutant fitness generates more confident predictions of gene function than does gene expression. The approach described here can be applied generally to create large-scale gene-phenotype maps for evidence-based annotation of gene function in prokaryotes

eScholarship - University of California

Glutamine versus Ammonia Utilization in the NAD Synthetase Family

NAD is a ubiquitous and essential metabolic redox cofactor which also functions as a substrate in certain regulatory pathways. The last step of NAD synthesis is the ATP-dependent amidation of deamido-NAD by NAD synthetase (NADS). Members of the NADS family are present in nearly all species across the three kingdoms of Life. In eukaryotic NADS, the core synthetase domain is fused with a nitrilase-like glutaminase domain supplying ammonia for the reaction. This two-domain NADS arrangement enabling the utilization of glutamine as nitrogen donor is also present in various bacterial lineages. However, many other bacterial members of NADS family do not contain a glutaminase domain, and they can utilize only ammonia (but not glutamine) in vitro. A single-domain NADS is also characteristic for nearly all Archaea, and its dependence on ammonia was demonstrated here for the representative enzyme from Methanocaldococcus jannaschi. However, a question about the actual in vivo nitrogen donor for single-domain members of the NADS family remained open: Is it glutamine hydrolyzed by a committed (but yet unknown) glutaminase subunit, as in most ATP-dependent amidotransferases, or free ammonia as in glutamine synthetase? Here we addressed this dilemma by combining evolutionary analysis of the NADS family with experimental characterization of two representative bacterial systems: a two-subunit NADS from Thermus thermophilus and a single-domain NADS from Salmonella typhimurium providing evidence that ammonia (and not glutamine) is the physiological substrate of a typical single-domain NADS. The latter represents the most likely ancestral form of NADS. The ability to utilize glutamine appears to have evolved via recruitment of a glutaminase subunit followed by domain fusion in an early branch of Bacteria. Further evolution of the NADS family included lineage-specific loss of one of the two alternative forms and horizontal gene transfer events. Lastly, we identified NADS structural elements associated with glutamine-utilizing capabilities